This document gathers and illustrates work on a set of tools for analyzing twitter communities and their interaction, built with hadoop and R. The report is auto-generated by using scripts that first aggregate data in hadoop, and then invokes R tools that compile the markdown for this document while incorporating statistics, figures and tables generated from the data. The scripts are here, and the markdown source for this report here.
Data is gathered by using twitter’s public streaming API endpoint to extract tweets related to Spanish political parties. A Flume agent with a Twitter4j source is used to funnel data received from Twitter’s API onto disk. The custom Flume source can be configured to follow a number of users, track a number of phrases, and to filter tweets by language. An example configuration of the agent can be found here. Note that twitter account ids (instead of names) have to be used to follow users.
The current strategy collects all tweets that originate in the official accounts of the major political parties as well as their official leaders and spokespersons; as well as those which mention or retweet any of these accounts. In addition, a number of keywords related to the elections are tracked. The result is filtered to only include tweets in Spanish.
The resulting tweets are stored on a distributed file system (hdfs) as raw json files. They are arranged in daily folders, with individual files containing a roughly equal number of tweets. Hive provides an SQL-like view on these json files, with Hive tables being partitioned by day also. Various hql scripts export daily edge-lists for mentions and retweets and other aggregated summaries into local text files.
A number of different graphs, and tables can then be generated from the local data and inspected using the R tools. Graphs are distinguished by layer (retweet, mentions or both combined) and version (e.g. one for the period of the catalan elections and one for the general elections). Graphs are also locally cached as R data files, so they don’t have to be re-created for each analysis (unless new data needs to be incorporated).
Each graph layer consists of vertices representing twitter accounts. Edges between those vertices capture the number of times a user A has retweeted another user B (in the retweet layer R), or how many times A has ‘mentioned’ B (in the mention layer M). First we will look at some overall statistics for the different graph layers.
In the following sections we’ll use tweets from the catalunya election period as an example.
We start off by looking at the volume and frequency of tweets using an hql script that aggregates the number of tweets per hour. In total, for the catalunya version of the graph, we have 3.228.346 tweets in the period between 04-08-2015 and 07-10-2015 (65 days).
For a more detailed view we can look at tweet frequencies over time, at different levels of resolution:
On the left tweets counts are shown aggregated by day and for the complete period. On the right an hourly count is shown for a single week.
The following tables provide information about very general statistics of the three graph layers before and after basic preprocessing. The preprocessing consists in
Lastly, stray nodes resulting from the filtering (those not connected to any other) are removed too.
Note that the total number of tweets from the previous section (3.228.346), is not identical to the sum of retweet and mention edges or tweets. This is because retweets and mentions are not mutually exclusive. Retweets, for example, can also mention both the retweeted as well as other accounts. Equally a tweet may mention one or more accounts without being a retweet. As a result, the number of retweets only may be smaller than the total number of all tweets, as there will likely be mentions that are not retweets. On the other hand, the number of mentions may actually be greater than the actual total of tweets (e.g. if the average number of mentions per tweet is greater than 1). Also note that for the combined layer, the weights of edges (#tweets) between identical pairs of accounts in the retweet and mention layers are simply added, so the corresponding “tweet” numbers do add up. In contrast, the combined number of edges corresponds to the union of edges in both layers, so the number of edges do not necessarily add up. One layer may in fact be a subset of the other. E.g. in most cases it seems that if one user has retweeted another, then they will also have mentioned that other user. The converse is not generally true. Many users seem to mention others without ever retweeting them. As a result the set of connections formed by mentions may already contain all retweet connections, but not the other way round.
Next we can plot the edge weight and node degree distributions of the (preprocessed) retweet layer:
Figure: Log-log plots of network weight and degree distribution.
It’s is clear that both distributions follow a power law (as common in small-world networks, for example).
We can also find various node-centrality (importance) measures. E.g. here are the 5 most central twitter accounts with respect to in-degree (number of retweeters) and page-rank for the retweet layer:
Note that the in-degree here corresponds to the number of users having retweeted a particular account, not the number of retweets received (which requires taking into account the weight of each connection).
Plotting a graph with hundreds of thousands of nodes and millions of edges usually results in an unintelligeable “hairball”. We may however, filter the graph down to more manageable size, for example by looking at a daily snapshot.
The following figure shows the graph for the day of 27-09-2015: